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Abstract 

Inference in general Markov random fields 
(MRFs) is NP-hard, though identifying the 
maximum a posteriori (MAP) configuration 
of pairwisc MRFs with submodular cost func- 
tions is efficiently solvable using graph cuts. 
Marginal inference, however, even for this re- 
stricted class, is in #P. We prove new for- 
mulations of derivatives of the Bethe free en- 
ergy, provide bounds on the derivatives and 
bracket the locations of stationary points, 
introducing a new technique called Bethe 
bound propagation. Several results apply 
to pairwise models whether associative or 
not. Applying these to discretized pseudo- 
marginals in the associative case we present 
a polynomial time approximation scheme for 
global optimization provided the maximum 
degree is O(logn), and discuss several exten- 
sions. 



1 Introduction 

Markov random fields are fundamental tools in ma- 
chine learning with broad application in areas includ- 
ing computer vision, speech recognition and computa- 
tional biology. Two forms of inference are commonly 
employed: maximum a posteriori (MAP), where the 
most likely configuration is returned; and marginal, 
where the marginal probability distributions for each 
set of variables with a linking potential function are 
returned. In general, MAP inference is NP-hard [17] 
and marginal inference, even for pairwise models, is 
harder still in #P (2U [21 14] . 

An important class of MRFs, those with only unary 
and pairwise submodular cost functions, admits effi- 
cient MAP inference. This was first shown for binary 
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models [5] and applied broadly in computer vision [T] , 
where the graph cuts method is particularly effective 
[22] . Recent work extended the application of this 
approach to multi-label submodular energies of up to 
third order [141 116) . Yet marginal inference, even for 
binary pairwise models, is intractable with few known 
exceptions. Belief propagation is efficient (and exact) 
for trees, and loopy belief propagation is guaranteed 
to converge when the topology has one cycle [25] . 

Applying the same framework to general models, 
termed loopy belief propagation (LBP), has proved re- 
markably effective in some situations but fails in others 
and has no general guarantees on convergence. A key 
result is that belief propagation (BP) fixed points co- 
incide with stationary points of the Bethe variational 
problem |27j . Stationary points, however, may not 
identify the global optimum of the the Bethe free en- 
ergy. Subsequently, it was further shown that all stable 
BP fixed points are known to be local optima (rather 
than saddle points) of this problem, but not vice versa 
[5J US]- Variational methods demonstrate that min- 
imizing the Bethe free energy should deliver a good 
approximation to the true marginal distribution and 
recently [TS] proved that for submodular MRFs, the 
Bethe optimum is an upper bound on the true free en- 
ergy and thus yields a desirable lower bound on the 
partition function. 

Marginal inference is a crucial problem in probabilis- 
tic systems. A noteworthy example is the Quick 
Medical Reference (QMR) problem [20], a graphical 
model involving 600 diseases and 4000 possible find- 
ings. Therein, medical diagnostics are performed by 
computing the posterior marginal probability of each 
disease given a set of possible findings. The marginal 
distribution over the presence of a disease must often 
be precisely estimated in order to determine the course 
of medical treatment. Thus, we seek the probability 
that a patient suffers from a condition, rather than the 
MAP estimate, which could be very different. 

Marginal inference also arises during learning or pa- 
rameter estimation in Markov random fields. For in- 
stance, computing the gradients of a partition func- 
tion in a maximum likelihood estimation procedure 
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is equivalent to marginal inference. In learning prob- 
lems, the intractability of the marginal inference prob- 
lem requires the exploration of marginal approxima- 
tion schemes [BJ. However, in the general case, both 
exact marginal inference and approximate marginal in- 
ference are NP-hard [3J [3] . 

1.1 Contribution 

We derive various properties of the Bethe free en- 
ergy and apply them to discretized pseudo-marginals 
to prove a polynomial-time approximation scheme 
(PTAS) for the global minimum of the Bethe free en- 
ergy for binary pairwise associative MRFs. 

The idea is that if we can find the optimal discretized 
point on a sufficiently fine mesh that covers all possible 
locations of an optimum point within a distance of d, 
then we can bound the difference to the optimum by 
i AS 2 where A is the greatest directional second deriva- 
tive. To our knowledge, we present the first rigorous 
bounds on A. One reason this is difficult is that deriva- 
tives tend to infinity as singleton marginals approach 
the boundary cases of or 1. Hence we need to prove 
bounds on the location away from these edges. 

We first prove various bounds including on the loca- 
tion of any stationary point of the Bethe free energy, 
as well as on the true marginals. In doing this we 
develop Bethe bound propagation (BBP) which some- 
times produces remarkably tight bounds by itself. We 
then consider the second derivatives with a view to 
bounding A. Additional analysis allows us to prove 
that the discretized multi-label problem is submodu- 
lar on any mesh and hence the discretized optimum 
can be found efficiently using graph cuts |16) . 

Various extensions are discussed in the closing section, 
including applications to non-associative models, to 
models that are themselves multi-label and to mod- 
els with higher order terms. 

1.2 Related work 

A variety of heuristics have been proposed for marginal 
inference problems. Marginal inference in the QMR 
medical diagnostic problem has been explored with 
Markov Chain Monte Carlo (MCMC) [US El [3] meth- 
ods, variational methods [TT], and search methods 
[5]. Many of these heuristics are restricted to certain 
classes of graphical model (such as QMR). Here we 
explore another approach to approximate marginal in- 
ference by minimizing the Bethe free energy. 

The minimization of Bethe free energy is often ap- 
proached using loopy Belief propagation. However, 
there are few guarantees on the rate of convergence 
of LBP which prevent it from functioning as a PTAS 



for Bethe minimization |24j . An important contribu- 
tion [26] showed that the Bethe free energy of a binary 
pairwise MRF may be considered as a function only of 
the singleton marginals, however this connection was 
provided without convergence results. 

A PTAS was recently proposed [18] for the location of 
a point whose derivative of the Bethe free energy has 
magnitude less than e. However, this identifies only 
an approximately stationary point (which may not be 
even a local minimum) that could be arbitrarily far 
from the global optimum. That result applies for a 
general binary pairwise MRF subject to an edge spar- 
sity restriction that the maximum degree is O(logn). 
Here we primarily focus on associative models with the 
same degree restriction, but our deliverable not only 
satisfies the property in [TB] but importantly is also 
guaranteed to have Bethe free energy within e of the 
optimum. 

We note that the PTAS in [18] may provide the global 
optimum when the fixed point is unique and recent 
work [25) has enumerated necessary and sufficient con- 
ditions for uniqueness. Nevertheless, aside from these 
restricted settings, there are no prior polynomial-time 
methods for finding or rigorously approximating the 
global minimum of the Bethe free energy. Earlier 
work considered discretizations of pseudo-marginals 
but presented incomplete results [12]. We go signifi- 
cantly further in deriving additional key results which 
together admit the PTAS. These include explicit forms 
and bounds on the second derivatives, on the third 
derivatives and on the locations of stationary points. 

2 Preliminaries & Notation 

We focus on a binary pairwise MRF over n variables 
X\, . . . , X n with topology (V, £ ) and generally follow 
the notation of [26] . We assumq^J 

e -E{x) 

p{x) = — - — , E = -) j OjXj - WtjXiXj (1) 

iev (i,j)es 

where the partition function Z = *^2 X e~ E ( x ' is a nor- 
malizing constant. Let F be the Bethe free energy, so 
F = E — S where S is the Bethe approximation to the 
true entropy, S = E(ij) e £ s v + E ie v( 1 ~ z «)<%- S ij 
is the entropy of a pseudo-marginal of ( Xi , Xj ) on the 
local polytope, Si is the entropy of the singleton dis- 
tribution and Zi is the degree of i, that is the number 
of variables to which Xi is adjacent. We assume the 

1 The energy E can always be thus reparameterized with 
finite 8i and Wij terms provided p(x) > Vx. There are 
reasonable distributions where this does not hold, i.e. 3x : 
p(x) = but this can often be handled by assigning such 
configurations a sufficiently small positive probability e. 
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model is connected so all zi > 1. For each node i define 
sum of positive and negative incident edge weights: 

Wi — J2jeN(i):W i:i >Q Wij, Vi = — X.jeN(i):Wij-<0 Wij 

where N(i) indicates the neighbors of node i. For a 
pseudo-marginal distribution q, let g, = p(X{ = 1). 
Consistency and normalization constraints from the 
local polytope imply 



= fl+€ij-qi- Qj Qj ~ 6j 



(2) 



for some £y € [0,min(g 1 ,, Qj)], where fJ*ij(a, b) = p(Xi = 
a,Xj = b) is the pairwise marginal. Let a,j = e Wij —1. 
a,ij = <S=> Wij = may be assumed not to occur else 
the edge may be deleted, has the same sign 
as Wij, if positive then the edge (i, j) is associative; if 
negative then the edge is repulsive. The MRF is asso- 
ciative if all edges are associative. As in [26 , one can 
solve for £y explicitly in terms of qi and qj by mini- 
mizing the free energy, leading to a quadratic equation 
with real roots 

- [1 + a ij(li + 1j)]&j + (1 + oiij)<li<lj = 0- (3) 

For (Xij > 0, £,ij(qi,qj) is the lower root, for ctij < 
it is the higher. Notice that when onj = (no edge 
relationship) this reduces as expected to £y = p(Xi = 
l,Xj = l)=p(X i = l)p(X j = l) = q i qj. 

Sij is the entropy of fiij(qi,qj). Hence 

F(q)= >: -(Wijliij + S^qj)) 



E - 



(4) 



Collecting the pairwise terms for one edge, define 

fij(Ui,9j) = 11 <./-'.;('/.• 9j) - Sij(qi,qj). (5) 

We are interested in discretized pseudo-marginals 
where for each qi we restrict its possible values to a 
discrete set Di of points in [0, 1]. Note we may often 
have A ^ Dj. Let V = \[ ieV A- 

In [315], the first partial derivative of the Bethe free 
energy is derived as 



dF 

dqi 



6i + log Qi , where (6) 
rijeN(i)(Si - £y) 



Recall the sigmoid function cr(a;) = 1/(1 + exp(— x)) 
which will be used for Bethe bounds. We write Ai for 
the lower bound of qi and Bi for the lower bound of 
1 - qi so Ai < qi < (1 - Bi). Define rn = wm(Ai,Bi). 



2.1 Submodularity 

In our context, a pairwise multi-label function on a 
set of ordered labels Xi 
is submodular if 



{1,...,A} x {1,...,^} 



Vz, y G /(.t A y) + /(z V y) < /(.t) + /(y) (7) 

where for x = (xi,x 2 ) and y = (j/i, A 
y) = (min(xi,yi),min(x 2 ,y 2 )) and (x V y) = 
(max(xi, yi), max(x2, yi))- For binary variables this 
is equivalent to associativity. 

The key property for us is that if all the pairwise cost 
functions fij over A x A f rom © are submodular 
then the global discretized optimum may be found ef- 
ficiently as a multi-label MAP inference problem using 
graph cuts [16]. 

3 Bounds Sc Bethe bound propagation 

We use the technique of flipping variables, i.e. con- 
sidering Yi = 1 — Xi. Flipping a variable flips the 
parity of all its incident edges so associative «-> repul- 
sive. Flipping both ends of an edge leaves its parity 
unchanged. 

3.1 Flipping all variables 

Consider a new model with variables {Yi = 1 — Xi, i = 
1, . . . , n} and the same edges. Instead of 0jS and W^s, 
let the new model have parameters fa and Vij. We 
identify values such that the energies of all states are 
maintained up to a constant^] 

tev (i,j)ee 

= const Mi J2 V a( l -Xi){i-Xj 

iev (i,j)e£ 
Matching coefficients yields 

Vij = Wij, fa = -9i- W v = ~ & i - W i- ( 8 ) 

If the original model was associative, so too is the new. 

3.2 Flipping some variables 

Sometimes we flip only a subset 1Z C V of the variables. 
This can be useful, for example, to make the model 
locally associative around a variable, which can always 
be achieved by flipping just those neighbors to which 
it has a repulsive edge. Let Yi = 1 — Xi if i G TZ, else 



2 Any constant difference will be absorbed into the par- 
tition function and leave probabilities unchanged. 
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Yi = Xi for i £ S, where S = V \ 1Z. Let £t — {edges 
with exactly t ends in TZ} for t = 0, 1, 2. 

As in !3.1[ solving for Vy and fa such that energies are 
unchanged up to a constant, 



Vu = 



-Wij (i,j)££ i 

i + E (M)efl i £ s, 



(9) 



Lemma 1. Flipping any set of variables changes af- 
fected pseudo-marginal matrix entries ' locations but 
not values. The Bethe free energy is unchanged up 
to a constant, hence the locations of stationary points 
are unaffected. 

Proof. By construction energies are the same up to a 
constant. The singleton entropies are symmetric func- 
tions of ft and 1 — qi so are unaffected. The impact on 
pseudo-marginal matrix entries follows directly from 
definitions. Thus Bethe entropy is unaffected. □ 



To obtain the right inequality, flip all variables as in 

section [37X1 Using the first inequality, (|SJ) and Lemma 

CD yields 1 - ft > <r(-6i ~ Wi) «*■ ft < cr(0< + W t ) 

since 1 — cr(— x) — a(x). To show the result for the 

true marginal, let m i=a = Y^x:x t =a ex P(J2iev + 

Yj(i,j)eE W ij x i x j) tnen usin S Oi Pi 
Since all Wa > the result follows. 



m l= i+m i=0 ' 
□ 



Using ([9|) we obtain a more powerful corollary. 
Theorem 4. For general edge types ( associative 
or repulsive), let W l = J2jeN(i):Wij>o W H> V * = 
~J2j£N(i):Wi<o Wij- At any stationary point of the 
Bethe free energy, a(0i — Vi) < qi < <t(0, + Wi). The 
same sandwich result holds for the true marginal pi . 

Proof. Using ©, flip all variables adjacent to Xi with 
a repulsive edge, i.e. set 1Z = {j £ N(i) : Wij < 0}. 
The resulting new model is fully associative around Xi 
so we may apply Theorem [3] to yield the result. □ 

The following lemma will be useful. 

Lemma 5. For ft, ft £ [0, 1], < ft + qj — 2q.iqj < 1. 



3.3 Bounds 

We derive several results that are useful in bounding 
the Bethe free energy as well as the marginals. 

Lemma 2. ay ■ (I > £y > ftft, ay \ ■ £y < ftft 

Proof. The quadratic equation Q for £y may be 
rewritten £y- ft ft = ay (ft-£y)(ft-£y). Both terms 
in parentheses on the right are elements of the pseudo- 
marginal matrix fj, so are constrained to be > 0. □ 

This simple result is sufficient to bound the location of 
stationary points of the Bethe free energy away from 
the edges of and 1, though we improve the bounds 
in Lemma [5] 

Theorem 3. If all edges incident to Xi are associative 
then at any stationary point of the Bethe free energy, 
o~{0i) < ft < cr(9i + Wi). Remark exactly the same 
sandwich result holds for the true marginal pi . 

Proof. We first prove the left inequality. Consider ^ . 
Using a.ij > Vj £ IM(i) and Lemma [2] we have 



n"jeN(i)(* &j) 



(1 - ft) 



< 



c * n iG N(i)( i +^j -li-qj) 

IljeNfflg^-gj) (l-ft) 2 '" 1 



<l 



z»-l 



ft 



1 - ft 



which gives the result. 



Proof. Let / = ft+ft — 2ftft. To show the left inequal- 
ity, consider m = min(ft,ft) and M = max(ft,q' J ), 
then / > 2m(l — M) > 0. For the right inequality 
observe 1 — / = (1 — ft)(l — ft) + ftft > 0. □ 

Lemma 6 (Better lower bound for £y). /fay > 0, 
then £y > ftft + ayftftl 1 - ft)(l - ft)/[l + ay (ft + 
ft — 2ftft)], equality only possible at an edge, i.e. one 
or both of qi, qj £ {0, 1}. 

Proof. Write £y = ftft + y and substitute into Q, 

ayy 2 -y[l+ay(ft+ft-2ftft)]+ayftft(l-ft)(l-ft) = °- 

We have a convex parabola which at y = is above 
the abscissa (unless ft or ft € {0, 1}) and has negative 
gradient by Lemma[5j Hence all roots are at y > and 
given convexity we can bound below using the tangent 
at y — which yields the result. □ 

Lemma 7 (Upper bound for £y). If onj > 0, then 

ft ?y ^ l+a^( gi + ? j-2, 49 _,) - 1+Q.y 

„ _ t > giU-gj) > gj(i-gj) 

% - 1+aij ( qi+qj -2q iqj ) - l+a zj ■ 

Also £ij < m{cti + M)/(l + ay) =>■ ^y - ftft < 

aijm(l— M) 

Proof. We prove the first inequality. The second fol- 
lows by Lemma [S] and those for qi — ^y follow by sym- 
metry. The final inequality follows by combining the 
earlier ones. Let £y = ft + y and substitute into ([3]) 

ayy 2 + t/[ay(ft - ft) - 1] + ft (ft - 1) = 0. 
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The function is a convex parabola which at y = is 
at qj(qi — 1) < OO From Lemma [2] we know that the 
left root is at £y > qiqj so we may take the derivative 
there, i.e. at <y + y = qiq,j O y = qj{q% — 1) and 
by convexity use this to establish a lower bound for 
1j ~ Cij- That derivative is 2onjqiqj — 2ay<7j + a^qj — 



atijqi 



[1 + a lj {q l + qj - 2q l q j )] 



□ 



Lemma 8. Unless qi or qj e {0, 1}, all entries of the 
pseudo-marginal /J,ij are strictly > 0, whether is 
associative or repulsive^ 

Proof. First assume ay > 0. Considering ([2) and us- 
ing Lemmas [5] and [JJ we have that element-wise 



(10) 

which proves the result for this case. If ay < then 
flip cither (/; or qj . As in the proof of Lcmma[TJ pseudo- 
marginal entries change position but not value. □ 

3.4 Bethe bound propagation (BBP) 

We have already derived bounds on stationary points 
in Theorems [3] and |4] Here we show for variables with 
only associative edges how we can iteratively improve 
these bounds, sometimes with striking results. Note 
that a fully associative model is not required, and as 
in section |321 any model may be selectively nipped to 
yield local associativity around a particular node. 

We first assume all ay > and adopt the approach of 
Theorem [3l now using the better bound from Lemma 
[6] to obtain 



< 



q% - ft<7j 



qi(l - qj) 



"-■'/.'/; : i - giXl - qj) 

1 + ay(% + <y - 2q, t qj) 



1 + 



Qj 



> 



i + '/-'/. - qi - qj - 
= (l -<&•)(!-<&) 



1 + Oij(qi + qj - 2qiqj) 



"/■'/.'/■ ! i - ~ qj) 
1 + ay(% + qj - 2(1,(1, ) 



1 + ay(g; + - 2q i <y) 



Hence Q < j^- UjeN(i) ^y 1 where 



^> _ l+aij(gi+gj-2<ji(?j) _ ^ 



1 - 



l+ai3(9i+9j-2(3i(2j) 



1 + '!,,</,.: ! - <y) 



''This confirms neatly that we must take the left root 
else y > =>■ /ioi < (a contradiction). 

4 Here we assume a»j is finite, see footnote ff] 



monotonically increasing with and decreasing with 
qi. Hence 



Aj 



e^ = l + ay>/ey>,y := 1-, J + ^ (1 _ _ 

(11) 

Using Theorem [3] we initialize A = <r(#j) and £>; = 

1 - <7(0i + Wi). 

Using ([6]), at any stationary point we must have 

q % > l/[l + exp(-^)/ii] 

where Li — YijeNU) ^ij- Intuitively, in an associative 
model, if variable i has neighbors j which are likely to 
be 1 (i.e. high Aj) then this pulls up the probability 
that i will be 1 (i.e. raises A)- 

Flipping all variables, 

l-ft> 1/[1 + exptfi + Wi)/Ui] 
where Ui = rijeN(i) ~Uij with 
-Wi, 



e > Uij := 1 



l + ay(l- A)(1-S J ) 
It is also possible to write this as 

a{6i + logL,) < q t < + Wi - logUi). 

This establishes a message passing type of algorithm 
for iteratively improving the bounds {A, Bi}. Repeat 
until convergence: 

new A <- (1 + exp(-0i)/Li) _1 
new B< <- (1 + exp(^ + Wi)/^)" 1 
recompute L,-, [/, using new Ai, Bi. 

Lemma 9. At every iteration, all of Ai, Bi, Ly , £/y 
monotonically increase. 

Proof. All of the dependencies are monotonically in- 
creasing on all inputs. The first iteration yields an 
increase since each Ly , C/y > 1 . □ 

Since Ai + Bi < 1, each is bounded above and we 
achieve monotonic convergence. Combining this with 
the main global optimization approach can dramati- 
cally reduce the range of values that need be consid- 
ered, leading to significant time savings. Convergence 
is rapid even for large, densely connected graphs. Each 
iteration takes 0(|£|) time; a good heuristic is to run 
for up to 20 iterations, terminating early if all parame- 
ters improve by less than a threshold value. This adds 
negligible time to the global optimization. 
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This procedure alone can produce impressive results. 
For example, running on a 100-node graph with in- 
dependent random edge probability 0.04 (hence av- 
erage degree 4), each W\j and 9i drawn randomly 
from Uniform [0, 1] and then adjusting Qi £- Qi — 
X^eN(i) Wij*/2 in order to be unbiased, convergence 
takes about 11 iterations yielding final average bracket 
width of 0.05 after starting with average bracket width 
of 0.40. Greater connectivity, higher edge strengths 
and smaller individual node potentials make the prob- 
lem more challenging and may widen the returned final 
brackets significantly. 

3.5 BBP for general models 

A repulsive edge may always be flipped to as- 
sociative by flipping variable j, which flips its Bethe 
bounds Aj ■<-» Bj . Using Theorem 2] we can extend 
the analysis above to run BBP on any model, see Al- 
gorithm 1. Performance in terms of convergence speed 
and final bracket width is similar for associative and 
non-associative models. 

Algorithm 1 BBP for a general binary pairwise model 
{Initialize} 
for all i £ V do 

Wi = Z, 3 ' e N(i):Wy>0 ' 

V* ~ ~ J2jeN(i):W i:i <o Wij, 
A, = aie, - Vi), Bi = l- a{0i + W t ) 
end for 

for all (i, j) £ £ do 

ctij = exp(\Wij\) - 1 
end for 
repeat 

for all i £ V do 

Li = 1, Ui — 1 {Initialize for this pass} 
for all j £ N(i) do 
if Wij > then 
{Associative edge} 

T .j, 1 j QijAj 

L, t *-L + (1— A 3 ) 

else 

{Repulsive edge} 
Li* = 1 



T T 1 I j 

end if 
end for 

Ai = 1/(1 + exp(-0i + Fi)/Li) 
^ = 1/(1 + exp(^ + Wi)/C7i) 
end for 

until All j4i,Bj changed by < THRESH or run 
MAXITER times 

{Suggested THRESH^ 0.002, MAXITER= 20} 



4 Higher derivatives & submodularity 

We first derive a novel result for the second derivatives 
of an edge which will be crucial later for bounding 
the error of the discretized global optimum and also 
will allow us to show that the discretized multi-label 
problem is submodular. 

4.1 Second derivatives for each edge 

Theorem 10. For any edge for any ctij, writing 

f = fij and i± ab = /%(a, b) from 



d 2 f 



2 = TfrQii 1 - <li) 



T 



d 2 f 



d 2 f 



, , , , -7^(^01^10-^00^11) 
oqidqj (>ii, (>ii, Tij 



d 2 f = l 
T 



2 -r, 



where = qiqj(l - qi){l - q 3 ) - (£y ~ Qilj) 2 > ™iA 
equality only for qi or qj £ {0,1}. Further /X01M10 — 
A'oo/^n = qiqj ~ Cij an d h- as the sign of —ctij. 

Proof. We begin with the same approach as [12] but 
extend the analysis and derive stronger results. 

For notational convenience add a third pseudo- 
dimension restricted to the value 1. Let y = (2/1,2/2,2/3) 
be the vector with components 2/1 = Xi, 2/2 = Xj and 
2/3 = 1 where £ 1. Define 7r(y) = /^(xj, £j), 

and cf>(y) — Wij if y = (1, 1, 1) or <fr(y) — otherwise. 
Let r = (gj,5j,l). Define function h used in entropy 
calculations as h(z) = — zlogz. 

Consider (J5J) but instead of solving for £y explicitly, 
express / as an optimization problem, minimizing free 
energy subject to local consistency and normalization 
constraints in order to use techniques from convex op- 
timization. We have /(c/i, qj) = <?(r) where 

g(r) = min ( - (j>(y)n(y) - h(n(y))) 
y 

s.t. <y)= r * fc=l,2,3. (12) 

y:Kfe = l 

The Lagrangian can be written as 

L r (n, A) = £[("^(y) - (y, A))7r(y) - ft(7r(y))] + (r, A) 



and its derivative is 
dL r (ir,\) = 
8tt 



>(y)-(y, A) + i + iog7r 



which yields a minimum at 

tta(y) = exp(^(y) + (y, A) - 1). 



(13) 
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Since the minimization problem in (14) is convex 
and satisfies the weak Slater's condition (the con- 
straints are affine), strong duality applies and g(r) = 
max a G(r, A) = G(r, A*(r)) where the dual is simply 

G(r, A) = min L t {-k, A) = - V 7r A (y) + (r, A). (14) 

7T ' " 



Let D k (r, A) = °^ A) then D k (r, A*) = 0, k = 1, 2, 3. 

Hence = = ^ fe us ing (fT4| . Focusing on 

our goal of obtaining second derivatives, we consider 



8ridr k 



which we shall express in terms of 



•- dx t d\ k - a\ t ■ 
Differentiating D k (r, A*) — with respect to 77, 

3 



= 



dD k (r, A*) _ aD fe , ^ 0£>* 0A, fc / = x 2 3 



9r/ 



dri ^-J <9A P 9r ; 
p— 1 ^ 



Sr! dridX k 



Ski hence — 5, 



Considering (fli)). ^ 

E p C fep ^. Thus 5^ = -[G^. Using its 



definition and ([T4|) . we have 

a 2 G d 



a 



kl 



dXid\ k d\i 
y 



r k 



7TA(y)- 



Earlier work |12) stopped here, recognizing that 
det G < 0. We more precisely characterize this matrix 



Note that stronger edge interactions lead through 
higher |ay| to greater (£y — ft ft) 2 and hence larger 
second derivatives. 

4.2 Third derivatives for each edge 

Lemma 11 (Finite 3rd derivatives). For any edge 
with ciij > 0, if qi,qj G (0,1) then all third 
derivatives exist and are finite. 

Proof. Using Theorem [TU1 noting Ty > strictly and 
considering ©, it is sufficient to show is finite. 
We may assume k € {i,j} else the derivative is and 
by symmetry need only check . Differentiating , 

<K, , = atij(qj ~gy) + ft 

dqi 1 + dij (ft - £y + ft - fy ) ' 

clearly finite for ay > since recalling ©, ft — £y 
and qj — £y are elements of the pseudo-marginal and 
hence are non- negative (or use Lemma [7]). □ 

4.3 Submodularity 

Theorem 12. If a binary pairwise MRF is submodu- 
lar on an edge i.e. otij > 0, then the multi-label 

discretized MRF for any discretization T> is submod- 
ular for that edge. In particular, if the MRF is fully 
associative/ submodular, i.e. ay > V(i,j) G £, then 
the multi-label discretized MRF is fully submodular for 
any discretization. 



(M10 + M11 Mn Mio+Mu\ 
M01+M11 M01+M11 (15) 
Mm + Mil M01+M11 1 / 

Recall constraints M00+M01+M10+M11 = 1; M01+M11 = 
Qj, M10 + Mil — Qi- Note G is symmetric. 

Applying our result above and using Cramer's rule, 

d 2 f_d 2 g_ 1 gj .(i_ g .) 
W^drJ' -deic(Moi+Mu)(Moo+Mio) - _ detC 

<9 2 / <9 2 / <9 2 # (M01M10 - MooMu) 



detG 



dftdft <9ft% <9ri<9r 2 

a 2 / <9 2 5 1 

dq 2= dr 2= -dei^^o+MiOlMoo+Moi) 



ft(l ~ ft) 
-detG ' 



Using (fTS"]) and simplifying, we obtain — det G = 

M00M10M11 + M10M11M01 + M11M10M00 + MoiMooMio- By 
Lemma [5] this is strictly > unless ft or ft G {0, 1}. 
Substituting in terms from ([2} and simplifying estab- 
lishes — det G = Ty from the statement of the the- 
orem, and M01M10 - M00M11 = ftft - Zij- The sign 
follows from Lemma [2] or observing from (fT3| that 
«l = e W„ = + 1. □ 

M01M10 J 



Proof. For any edge let / be the pairwise func- 

tion /y from (O and note the submodularity re- 
quirement (J7J. Let a; = (xi,^), 2/ = (2/1,2/2) be 
any points in [0, l] 2 . Define s(x,y) — (si,s 2 ) = 
(mm(x 1 ,y 1 ),mm(x 2 ,y 2 )), and t(x,y) = (ti,i 2 ) = 
(max(a;i,yi),max(x 2 ,2/2))- Let g(x,y) = /(si,s 2 ) + 
/(iijia) — f( s i,t2) — f(s2,ti), call this the submodu- 
larity of the rectangle defined by x,2/- We must show 
g{x, y) < 0. Note / is continuous in [0, l] 2 hence so also 
is g. We shall show that Vx,y G (0,1) 2 , g(x,y) < 
then the result follows by continuity. 

Assume x, y € (0, l) 2 . Consider derivatives of / in 
the compact set R = [si,ii] x [s 2 ,i 2 ]. Using ([6]) and 
Lemma [51 first derivatives exist and are bounded. By 
Theorem [TO] and Lemma [11] the same holds for sec- 
ond and third derivatives. Further, Theorem [TU] and 



< 0. 



Lemma [6] show that g q .g q . — g q .g q . 

If a rectangle is sliced fully along each dimension so 
as to be subdivided into sub-rectangles then summing 
the submodularities of all the sub-rectangles, internal 
terms cancel and we obtain the submodularity of the 
original rectangle. 
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Hence there exists an e such that if we subdivide the 
rectangle defined by x, y into sufficiently small sub- 
rectangles with sides < e and apply Taylor's theorem 
up to second order with the remainder expressed in 
terms of the third derivative evaluated in the interval, 
then the second order terms dominate and the sub- 
modularity of each small sub- rectangle < 0. Summing 
over all sub- rectangles provides the result. □ 

4.4 Second derivatives for singleton terms 

Let fi(qi) be the singleton terms from (U) for Xj. The 
only non-zero derivatives are with respect to qi. 

fife) = —0%q% + (zi - l)Si(qi) 

tt 1 =-0i- (Zi - 1) [log q, - log(l - qi)] 
dqi 

d 2 fi 1 

— -i- = —(zi — 1) — < for a connected graph. 

oqi q l {^~qi) 

Zi — 1 d 2 f 
Hence, ^ r < -^-j < 0, m — min(Aj, 



Viit-m) dqi 



Note that our error is one-sided since our discretized 
optimum can never be better than the true optimum. 
This may facilitate further analysis to find a better 
approximation by using points in the neighborhood to 
estimate the likely error. 

5.1 Complete matrix of second derivatives 

Theorem ITD1 and (jTSJ) provide all the terms. 

Lemma 13. All entries on the main diagonal of H 
are strictly positive, all others are < 0. 

Proof. Apply Theorem LTUJ If G £ then H v] = 

VI, q j - < 0. If i£,%^3 then //„ = 0. 

On the main diagonal 



Zj-1 

- qi) 



E 

jeN(i) 



T, 



(17) 
1 



(16) 



> 1 - z r + y- ~ 1j) = 

~ Qi(i - qi) ^^qA 1 - ftX 1 - n) »(! - ' 



□ 



5 Approximating the Global 

Optimum for an Associative Model 

We now assemble earlier results to form the complete 
matrix H of second derivatives of the Bcthe free energy 
F and use this to bound the error between the dis- 
cretized optimum and the global Bcthe optimum. In 
this section we assume the model is associative. Define 
the Bethe box to be the orthotope (sometimes called a 
hyper-cuboid) given by % e [A;, 1 — Vi G V. 

At the optimum (or any stationary point), all first 
derivatives are zero. If we choose our discretization 
mesh T> to be sufficiently fine then we can be sure that 
a point in the mesh is within distance 8 of a true op- 
timum. In particular, if we choose each Di so that in 
the qi dimension every point in [Aj, 1 — Bi] is within 
distance 7, then S 2 < n-f 2 . 

Using a first order Taylor expansion of F around a true 
optimum, with the remainder expressed in terms of the 
second derivative, the error of our discretized optimum 
versus the true Bethe optimum < ^A<5 2 , where A is the 
largest eigenvalue of H evaluated at some intermediate 
point, which we shall bound. Observe that the Bcthe 
optimum (any stationary point) must lie within the 
Bcthe box, and hence we may assume also that all 
mesh points are inside since it would be pointless to 
check outside it. We shall bound the largest eigenvalue 
of H anywhere within the Bcthe box@ 



5.2 Max eigenvalue &: complexity bound 

We have shown that H is a real symmetric matrix with 
strictly positive main diagonal and all other entries 
< 0. To further bound the entries we derive a lower 
bound for Tij at any point in the Bcthe box. Define 
Kij = rji rjj ( 1 - r/i ) ( 1 - rjj ) (^^xp ■ All terms are known 
from the data prior to the discrete optimization. 

Lemma 14. At any point in the Bethe box, Tij > Kij. 

Proof. Using Theorem 1101 and Lemma 

'aijm(l - M)\2 



Tij > q t qj(l - <7i)(l - q 3 ) - 



1 + a, 



>^.(l- ?i )(l-^)[l-(^_) 
L V 1 + out J 



□ 



Theorem 15. At any point in the Bethe box, each 
entry Hij satisfies —a < Hij < b where 



max 



1 



4 (i,j)es ctij + 1 



b = max 



(i,j)€S 4(2aij + l)r]iVj{l - r]i)(l - n 3 ) 

ir. ^ K + i) 2 



ieV T]i(l - T)i) 



1 - Zi 



E 



2aa + 1 



J This value can also be used to find an approximately 



stationary point |18j if required by considering the Taylor 
expansion of F' around a stationary point. 
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Proof. For any edge € £ , 



-Hij = 



Cij -QiQj < m(l - M)a# 1 



1 a,- 



1 + Oti 



K,. 



< -- 



1 



4 1 + an Ki 



Using (|17p and the expression from the proof of 
Lemma IT4"1 



1 - Zj 

mi 1 - m) 



£ 

jeN(i) &(1 - ft) 



1 - 



< 



- — £ 



j'eN(i) 



2a« + 1 



□ 



Since + 1 < 2ay + 1 we have the corollary that 



Ha < 



- .) — ■ We remark that at any minimum 
of the Bethe free energy, all eigenvalues are > so at 
these locations the maximum eigenvalue < TV H < 

SiGV 774(1-77;) + (j7i(l-?7i 



ij fj( 1 -»7j) 



In order to bound the largest eigenvalue, we may use 
recent results such as Corollary 2 in [29] , although we 
suspect that the particular properties of H given in 
Lemma [TBI may admit more precise bounds. 

Here we use the following elementary bound which al- 
lows us to relate to the concepts of sparsity or max- 
imum degree as in [18]. Let £ be the proportion of 
non-zero entries in H so the number of non-zero en- 
tries is n 2 T, < n + nA £ < since we have the 
main diagonal terms and two entries for each edge. Let 
£1 = max(a, b) from Theorem 1151 we have 



A < Jtr(H T H) < VSn 2 ^ 2 = nflV^. 



(18) 



Returning to the reasoning at the start of this section 
OH note that by using iV$ points in Di we can ensure 
7 < (1 — Bi — Ai)/(Ni + 1). Using worst case Bethe 
bounds (Ai = Bi = 0) we achieve maximum 7 distance 
in each dimension with - points for each variable, so 
the total number of nodes in the max-flow graph we 
need to solve the multi-label graph cuts problem is 



N < -. We require n-f- < f hence N 2 > 



2< 



Using 



(HHJ) it is sufficient if N 2 > nH \^ . Graph cuts is a 
max-flow algorithm for which there are push-relabel 
methods guaranteed to run in time 0(N 3 ) [7]. Hence 
our algorithm has worst case run time of " ^g^ 3 ^ 2 ' — 

0(n 6 E 3 / 4 f2 3 / 2 e~ 3 / 2 ). However, in practice runtime for 
this class of problem using the Boykov-Kolmogorov 
algorithm [1 often approaches O(N) for dramatically 
improved performance. 

Note above may depend of n. For our analysis in 
this paper we assumed the reparameterization in (p} 



but a natural specification avoiding bias is to provide 
maximum possible values W* and 9* with 



|0i| <d*Vie v. 



The required reparameterization for edge takes 
di 4— 9i — Wij/2, hence reparameterizing all edges 
takes 6i di — J2jeN(i) Wij/2. A sufficient condi- 
tion for ^77T3^ to have a polynomial upper bound is 
that the maximum degree A := maxigy 2, = O(logn), 
the same degree restriction as in [18]. In this case, 
= O(exp(0* + AW*/2)). 

Regarding Theorem [T5l now a = (9(exp(IU* + 20* + 
AW*)),b = 0(Aexp(IU* + 9* + AW*/2) with Q = 
max(a, b) and S = 0(A/n) yielding the polynomial 
result. 



6 Conclusion & Extensions 



To our knowledge, we have proved the first PTAS for 
the global optimum of the Bethe free energy of an 
associative binary pairwise MRF. In doing so we de- 
rived a range of other results, including several for gen- 
eral edges and models (associative or not), which may 
prove useful in their own right, including Bethe bound 
propagation. 

Although our algorithm is only weakly polynomial, we 
are not sure if it is possible to do better. Note that 
if we make no restriction on input parameters, then 
potentially a values could be infinite, corresponding 
to probability distributions with exactly zero proba- 
bility for some states (which may be reasonable), and 
this will lead to infinite derivatives as some pseudo- 
marginal entries will be driven to 0. 

[2 3) has shown that graph cuts is in a strong sense 
equivalent to max-product belief propagation with 
careful scheduling and damping. Together with our 
result this shows an interesting link between max- 
product and sum-product techniques. One direction 
to explore is how sum-product belief propagation fares 
using a scheme similar to |23j . 

We note that our approach immediately also applies 
to approximating optimum mean field marginals. In 
addition, it may readily extend to allow approximate 
marginal inference for multi-label and third order sub- 
modular MRFs, both of which can be mapped to 
equivalent associative binary pairwise MRFs [El [14] . 
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